Biostatistician technical interview

Analysis of an oncological dataset

Erik De Luca

Summary

Assignment description
Setting the enviroment
Descriptive Analysis
Survival Analysis
Deeper Analysis

Dataset Description

The response variable, SurvTime, is the survival time in days of a lung cancer patient.

The covariates are:

•Cell (type of cancer cell),

•Therapy (type of therapy: standard or test),

•Prior (prior therapy: 0=no, 10=yes),

•Age (age in years),

•DiagTime(time in months from diagnosis to entry into the trial)

•Kps(performance status).

A censoring indicator variable Censor is created from the data, with the value 1 indicating a censored time and the value 0 indicating an event time. Since there are only two types of therapy, an indicator variable, Treatment, is constructed for therapy type, with: value 0 for standard therapy and value 1 for test therapy.

Exercises

what was the maximum survival time for the cell type adeno?
what is the average age of subjects in this study?
which cell type appeared the most during this study?
Calculate descriptive statistics for all numeric variables within this dataset?
Perform a survival analysis to assess the survival time (variable SurvTime)? based on the cancerous cells (var Cell)? Consider applying survival functions/kaplan meier quartiles/cumulative incidence function/cox regression etc.
Perform an appropriate multivariable analysis to analyze the effect of independent variables age on the hazard ratio between the different cancerous cells (var Cell)?

Load Libraries

pacman::p_load(
  tidyverse,  # A set of many useful libraries
  readxl,     # To import the dataset from Excel
  here,       # To avoid problems with file directories
  janitor,    # To clean data in a fast way
  gt,         # Output tables
  gtsummary,  # Output tables for models and survival data
  survival,   # To manage survival data
  ggsurvfit,  # To plot survival analysis
  tidycmprsk  # To fit survival models
)

Import Data

data_imported <- read_xlsx(
  here("Akkodis interview", "Oncology_dataset_for_R.xlsx")
  )

data_cleaned <- data_imported |> 
  clean_names() |> 
  remove_empty() |> 
  remove_constant() 

data_cleaned |> 
  slice_sample(n = 1, by = c(therapy, cell)) |> 
  gt() |> 
  tab_header("Sample of the dataset cleaned",
             "The sample was stratified by the variables therapy and cell")

Sample of the dataset cleaned
The sample was stratified by the variables therapy and cell
obs	therapy	cell	surv_time	kps	diag_time	age	prior	treatment	censor	event
11	Standard	Squamous	42	60	4	81	0	0	0	1
37	Standard	Small	18	30	4	60	10	0	0	1
54	Standard	adeno	95	80	4	34	0	0	0	1
57	Standard	large	216	50	15	52	0	0	0	1
77	Test	Squamous	1	20	21	65	0	1	0	1
96	Test	Small	20	30	9	54	10	1	0	1

what was the maximum survival time for the cell type adeno?

data_cleaned |> 
  filter(
    surv_time == max(surv_time),
    .by = cell
    ) |> 
  select(obs, cell, surv_time, kps, age) |> 
  gt() |> 
  tab_header("Maximum Survival Time by Cell", 
             "Adeno's row is bolded") |> 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = list(
      cells_body(rows = cell == "adeno")
      )
  )

Maximum Survival Time by Cell
Adeno's row is bolded
obs	cell	surv_time	kps	age
44	Small	392	40	68
52	adeno	162	80	64
58	large	553	70	47
70	Squamous	999	90	54

what is the average age of subjects in this study?

data_cleaned |>  
  summarise(
    across(age, list(
      mean = mean,
      median = median,
      sd = ~ round(sd(.)),
      Q1 = ~ quantile(., .25),
      Q3 = ~ quantile(., .75),
      min = min,
      max = max
    ),
    .names = "{fn}"
    )
  ) |> 
  pivot_longer(everything(), names_to = "statistics") |> 
  gt() |>  
  tab_header(
    "Age", 
    "Mean and other position and variance indicators"
    ) |> 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = list(
      cells_body(rows = statistics == "mean")
      )
  )

Age
Mean and other position and variance indicators
statistics	value
mean	57.60
median	62.00
sd	11.00
Q1	50.00
Q3	65.25
min	34.00
max	81.00

which cell type appeared the most during this study?

data_cleaned |> 
  tabyl(cell) |> 
  adorn_pct_formatting(digits = 0) |> 
  gt() |> 
  tab_header("Cell's frequency") |> 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = list(
      cells_body(rows = n == max(n))
      )
  )

Cell's frequency
cell	n	percent
Small	41	41%
Squamous	35	35%
adeno	9	9%
large	15	15%

Calculate descriptive statistics for all numeric variables within this dataset?

data_cleaned |>
  select(-obs) |> 
  summarise(
    across(
      where(is.numeric),
           list(
      mean = mean,
      median = median,
      sd = ~ round(sd(.)),
      Q1 = ~ quantile(., .25),
      Q3 = ~ quantile(., .75),
      min = min,
      max = max
    ),
    .names = "{col}-{fn}"
    )
  ) |> 
  pivot_longer(everything(), names_to = "statistics") |> 
  separate(col = statistics, sep = "-", into = c("column", "statistics")) |> 
  pivot_wider(names_from = statistics, values_from = value) |> 
  gt() |>  
  tab_header("Descriptive Statistics", "All numeric variables")

Calculate descriptive statistics for all numeric variables within this dataset?

Descriptive Statistics
All numeric variables
column	mean	median	sd	Q1	Q3	min	max
surv_time	135.60	93.5	176	21.75	162.00	1	999
kps	58.65	60.0	20	40.00	80.00	20	90
diag_time	8.95	6.0	9	3.00	12.00	1	58
age	57.60	62.0	11	50.00	65.25	34	81
prior	3.10	0.0	5	0.00	10.00	0	10
treatment	0.31	0.0	0	0.00	1.00	0	1
censor	0.09	0.0	0	0.00	0.00	0	1
event	0.92	1.0	0	1.00	1.00	0	1

A Focus on Survival Time

We expect surv_time to follow an exponential distribution

ggstatsplot::gghistostats(data_cleaned, surv_time, binwidth = 70, title = "Histogram of surv_time")

Survival Analysis

Perform a survival analysis to assess the survival time? based on the cancerous cells? Consider applying survival functions/kaplan meier quartiles/cumulative incidence function/cox regression etc.

Kaplan-Meier plot

survfit2(Surv(surv_time, event) ~ cell, data = data_cleaned) |> 
  ggsurvfit(type = "survival") +
  labs(
    title = "Kaplan-Meier plot",
    subtitle = "Survival time based on cancerous cells",
    x = "Days"
  ) +
  scale_x_continuous(breaks = c(0, 100, 250, 500, 1000)) +
  add_risktable(times = c(0, 100, 250, 500, 1000)) +
  geom_hline(yintercept = 0, color = "tomato", linetype = "dashed")

Kaplan-Meier plot

Comparison of Survival Curves

Calculate the Log Rank Test

survdiff(Surv(surv_time, event) ~ cell, data = data_cleaned)

Call:
survdiff(formula = Surv(surv_time, event) ~ cell, data = data_cleaned)

               N Observed Expected (O-E)^2/E (O-E)^2/V
cell=adeno     9        9      5.2      2.78      3.04
cell=large    15       14     19.8      1.70      2.26
cell=Small    41       38     23.8      8.54     12.44
cell=Squamous 35       31     43.2      3.46      7.27

 Chisq= 18.4  on 3 degrees of freedom, p= 4e-04

Cox Regression

adeno is the reference category, so the hazard ratios are relative to it

cox_model <- coxph(Surv(surv_time, event) ~ cell, data = data_cleaned)
summary(cox_model)

Call:
coxph(formula = Surv(surv_time, event) ~ cell, data = data_cleaned)

  n= 100, number of events= 92 

                coef exp(coef) se(coef)      z Pr(>|z|)   
celllarge    -1.0003    0.3678   0.4358 -2.295  0.02172 * 
cellSmall    -0.1062    0.8993   0.3741 -0.284  0.77653   
cellSquamous -1.0385    0.3540   0.3990 -2.603  0.00925 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

             exp(coef) exp(-coef) lower .95 upper .95
celllarge       0.3678      2.719    0.1565    0.8640
cellSmall       0.8993      1.112    0.4320    1.8721
cellSquamous    0.3540      2.825    0.1619    0.7738

Concordance= 0.604  (se = 0.033 )
Likelihood ratio test= 17.32  on 3 df,   p=6e-04
Wald test            = 17.33  on 3 df,   p=6e-04
Score (logrank) test = 18.39  on 3 df,   p=4e-04

Visualize the Coefficients

survminer::ggforest(cox_model)

Multivariable Analysis

Perform an appropriate multivariable analysis to analyze the effect of independent variables age on the hazard ratio between the different cancerous cells (var Cell)?

Cox Regression with `cell` and `age`

multivariable_cox_model <- coxph(Surv(surv_time, event) ~ cell + age, data = data_cleaned)

multivariable_cox_model |> 
  tbl_regression(exponentiate = T) |> 
  add_global_p() |> 
  add_n(location = "level") |> 
  add_nevent(location = "level") |> 
  bold_labels() |> 
  bold_p() |> 
  italicize_levels()

Characteristic	N	Event N	HR	95% CI	p-value
cell					<0.001
adeno	9	9	—	—
large	15	14	0.36	0.15, 0.85
Small	41	38	0.85	0.40, 1.80
Squamous	35	31	0.34	0.15, 0.74
age	100	92	1.01	0.99, 1.03	0.5
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Diagnostic of the model

Testing the proportional hazards assumption for the multivariable Cox regression model

survminer::ggcoxzph(cox.zph(multivariable_cox_model))

Deeper Analysis

data_cleaned |> 
  select(surv_time, event, cell, age,
         therapy, diag_time) |> 
  tbl_uvregression(
    method = coxph,
    y = Surv(surv_time, event),
    exponentiate = T
  ) |> 
  add_global_p() |> 
  add_n(location = "level") |> 
  add_nevent(location = "level") |> 
  bold_labels() |> 
  bold_p() |> 
  italicize_levels()

Characteristic	N	Event N	HR	95% CI	p-value
cell					<0.001
adeno	9	9	—	—
large	15	14	0.37	0.16, 0.86
Small	41	38	0.90	0.43, 1.87
Squamous	35	31	0.35	0.16, 0.77
age	100	92	1.01	0.99, 1.03	0.4
therapy					0.2
Standard	69	64	—	—
Test	31	28	0.72	0.45, 1.16
diag_time	100	92	1.02	1.0, 1.05	0.14
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Just a check

kps (Key Performance Status) was not included because it’s a measure of the patients that we observe during the therapy not before, as age, diag_time or cell
If included, the model results are as follows:

coxph(Surv(surv_time, event) ~ kps, data = data_cleaned)

Call:
coxph(formula = Surv(surv_time, event) ~ kps, data = data_cleaned)

         coef exp(coef)  se(coef)     z        p
kps -0.029414  0.971014  0.006193 -4.75 2.04e-06

Likelihood ratio test=22.31  on 1 df, p=2.319e-06
n= 100, number of events= 92

What About the Therapy Purposed?

data_cleaned |>
  semi_join(data_cleaned |> filter(therapy == "Test"), by = "cell") |> 
  select(therapy, event, surv_time, cell, kps, diag_time) |> 
  tbl_strata(
    strata = cell,
    .tbl_fun = 
      \(x) x |>
      tbl_summary(
        by = therapy,
        type = kps ~ "continuous"
      ) |> 
      add_p()
  )

Characteristic	Small			Squamous
Characteristic	Standard N = 30¹	Test N = 11¹	p-value²	Standard N = 15¹	Test N = 20¹	p-value²
event	28 (93%)	10 (91%)	>0.9	13 (87%)	18 (90%)	>0.9
surv_time	53 (20, 122)	21 (8, 87)	0.080	100 (25, 144)	157 (32, 373)	0.3
kps	60 (40, 70)	40 (30, 70)	0.2	60 (40, 70)	70 (50, 80)	0.3
diag_time	4 (3, 11)	4 (2, 11)	>0.9	9 (5, 11)	7 (3, 13)	0.5
¹ n (%); Median (Q1, Q3)
² Fisher’s exact test; Wilcoxon rank sum test

Biostatistician technical interview

Summary

Dataset Description

Exercises

Load Libraries

Import Data

what was the maximum survival time for the cell type adeno?

what is the average age of subjects in this study?

which cell type appeared the most during this study?

Calculate descriptive statistics for all numeric variables within this dataset?

Calculate descriptive statistics for all numeric variables within this dataset?

A Focus on Survival Time

Survival Analysis

Kaplan-Meier plot

Kaplan-Meier plot

Comparison of Survival Curves

Cox Regression

Visualize the Coefficients

Multivariable Analysis

Cox Regression with `cell` and `age`

Diagnostic of the model

Deeper Analysis

Just a check

What About the Therapy Purposed?

Let’s Plot It!

Thanks

Biostatistician technical interview

Summary

Dataset Description

Exercises

Load Libraries

Import Data

what was the maximum survival time for the cell type adeno?

what is the average age of subjects in this study?

which cell type appeared the most during this study?

Calculate descriptive statistics for all numeric variables within this dataset?

Calculate descriptive statistics for all numeric variables within this dataset?

A Focus on Survival Time

Survival Analysis

Kaplan-Meier plot

Kaplan-Meier plot

Comparison of Survival Curves

Cox Regression

Visualize the Coefficients

Multivariable Analysis

Cox Regression with cell and age

Diagnostic of the model

Deeper Analysis

Just a check

What About the Therapy Purposed?

Let’s Plot It!

Thanks

Cox Regression with `cell` and `age`